Avoiding test set bias with rank-based prediction
نویسندگان
چکیده
Background: Prior to applying genomic predictors to clinical samples, the genomic data must be properly normalized. The most effective normalization methods depend on the data from multiple patients. From a biomedical perspective this implies that predictions for a single patient may change depending on which other patient samples they are normalized with. This test set bias will occur when any cross-sample normalization is used before clinical prediction. Methods: We developed a new prediction modeling framework based on the relative ranks of features within a sample in order to prevent the need for cross-sample normalization, therefore effectively avoiding test set bias. We employed modeling with previously published Top-Scoring Pairs (TSPs) methodology to build the rank-based predictors. We further investigated the robustness of the rank-based models in case of heterogeneous datasets using diverse microarray technologies. Results: We demonstrated that results from existing genetic signatures which rely on normalizing test data may be unreproducible when the patient population changes composition or size. Using pairwise comparisons of features, we produced a ten gene, platformrobust, and interpretable alternative to the PAM50 subtyping signature and evaluated the robustness of our signature across 6,297 patients samples from 28 curated breast cancer microarray datasets spanning 15 different platforms. Conclusion: We propose a new approach to developing genomic signatures that avoids test set bias through the robustness of rank-based features. Our small, interpretable alternative to PAM50 produces comparable predictions and patient survival differentiation to the original signature. Additionally, we are able to ensure that the same patient will be classified the same way in every context.
منابع مشابه
Prediction of Suicide Ideation Based on the Attentional Bias in Clinical and Non-clinical Populations
Objectives: This study aimed to predict the suicide ideation based on the attentional bias in clinical and non-clinical populations. Methods: Participants were 120 individuals (77 women and 43 men, age range 18-40 years) who were purposively selected and divided into three groups of clinical-suicidal (n=40), clinical non-suicidal (n=40) and non-clinical (n=40). They were measured by Suicide St...
متن کاملCorrecting the optimally selected resampling-based error rate: A smooth analytical alternative to nested cross-validation
High-dimensional binary classification tasks, e.g. the classification of mi-croarray samples into normal and cancer tissues, usually involve a tuning parameter adjusting the complexity of the applied method to the examined data set. By reporting the performance of the best tuning parameter value only, over-optimistic prediction errors are published. The contribution of this paper is twofold. Fi...
متن کاملPrediction of Severity of Delusion Based on Jumping-to-Conclusion Bias in Schizophrenia Patients
Objectives: New cognitive theories of delusions have proposed that deficit or bias in inference stage (a stage of normal belief formation) is significant in delusion formation. The aim of this study was predicting the severity of delusions based on jumping-to-conclusion bias in patients with schizophrenia. Methods: The sample consisted of 60 deluded patients with schizophrenia w...
متن کاملUnbiased Assesment of Learning Algorithms
In order to rank the performance of machine learning algorithms, many researchers conduct experiments on benchmark data sets. Since most learning algorithms have domain-specific parameters, it is a popular custom to adapt these parameters to obtain a minimal error rate on the test set. The same rate is then used to rank the algorithm, which causes an opt imistic bias. We quantify this bias, sho...
متن کاملArtificial Neural Networks as an Aid to Medical Decision Making: Comparing a Statistical Resampling Technique with the Train-and-Test Technique for Validation of Sparse Data Sets
The volume of computer-generated patient data currently available to the health care practitioner must be managed effectively to be of practical use in enhancing the treatment of the patient. Not only might he or she be overwhelmed by this data, but certain legal consequences are a threat if the data is available but underutilized or mis-utilized. One method for managing this data is by the use...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014